57 research outputs found

    Understanding High Dimensional Spaces through Visual Means Employing Multidimensional Projections

    Full text link
    Data visualisation helps understanding data represented by multiple variables, also called features, stored in a large matrix where individuals are stored in lines and variable values in columns. These data structures are frequently called multidimensional spaces.In this paper, we illustrate ways of employing the visual results of multidimensional projection algorithms to understand and fine-tune the parameters of their mathematical framework. Some of the common mathematical common to these approaches are Laplacian matrices, Euclidian distance, Cosine distance, and statistical methods such as Kullback-Leibler divergence, employed to fit probability distributions and reduce dimensions. Two of the relevant algorithms in the data visualisation field are t-distributed stochastic neighbourhood embedding (t-SNE) and Least-Square Projection (LSP). These algorithms can be used to understand several ranges of mathematical functions including their impact on datasets. In this article, mathematical parameters of underlying techniques such as Principal Component Analysis (PCA) behind t-SNE and mesh reconstruction methods behind LSP are adjusted to reflect the properties afforded by the mathematical formulation. The results, supported by illustrative methods of the processes of LSP and t-SNE, are meant to inspire students in understanding the mathematics behind such methods, in order to apply them in effective data analysis tasks in multiple applications

    LDPP at the FinNLP-2022 ERAI task: Determinantal point processes and variational auto-encoders for identifying high-quality opinions from a pool of social media posts

    Get PDF
    Social media and online forums have made it easier for people to share their views and opinions on various topics in society. In this paper, we focus on posts discussing investment related topics. When it comes to investment , people can now easily share their opinions about online traded items and also provide rationales to support their arguments on social media. However, there are millions of posts to read with potential of having some posts from amateur investors or completely unrelated posts. Identifying the most important posts that could lead to higher maximal potential profit (MPP) and lower maximal loss for investment is not a trivial task. In this paper, propose to use determinantal point processes and variational autoencoders to identify high quality posts from the given rationales. Experimental results suggest that our method mines quality posts compared to random selection and also latent variable modeling improves improves the quality of selected posts

    The role of habitat features in a primary succession

    Get PDF
    In order to determine the role of habitat features in a primary succession on lava domes of Terceira Island (Azores) we addressed the following questions: (1) Is the rate of cover development related to environmental stress? (2) Do succession rates differ as a result of habitat differences? One transect, intercepting several habitats types (rocky hummocks, hollows and pits, small and large fissures), was established from the slope to the summit of a 247 yr old dome. Data on floristic composition, vegetation bioarea, structure, demography and soil nutrients were collected. Quantitative and qualitative similarities among habitats were also analyzed. Cover development and species accumulation are mainly dependent on habitat features. Habitat features play a critical role in determining the rate of succession by providing different environmental conditions that enable different rates of colonization and cover development. Since the slope’s surface is composed of hummocks, hollows and pits the low succession rates in these habitats are responsible for the lower rates of succession in this geomorphologic unit, whereas the presence of fissures in the dome’s summit accelerates its succession rate

    Visual analysis of interactive document clustering streams

    Get PDF
    Interactive clustering techniques play a key role by putting the user in the clustering loop, allowing her to interact with document group abstractions instead of full-length documents. It allows users to focus on corpus exploration as an incremental task. To explore Information Discovery's incremental aspect, this article proposes a visual component to depict clustering membership changes throughout a clustering iteration loop in both static and dynamic data sets. The visual component is evaluated with an expert user and with an experiment with data streams

    GGNN@Causal News Corpus 2022: Gated graph neural networks for causal event classification from social-political news articles

    Get PDF
    The discovery of causality mentions from text is a core cognitive concept and appears in many natural language processing (NLP) applications. In this paper, we study the task of Event Causality Identification (ECI) from social-political news. The aim of the task is to detect causal relationships between event mention pairs in text. Although deep learning models have recently achieved a state-of-the-art performance on many tasks and applications in NLP, most of them still fail to capture rich semantic and syntactic structures within sentences which is key for causality classification. We present a solution for causal event detection from social-political news that captures semantic and syntactic information based on gated graph neural networks (GGNN) and contextualized language embeddings. Experimental results show that our proposed method outperforms the baseline model (BERT (Bidirectional Embeddings from Transformers) in terms of f1-score and accuracy

    Explaining neighborhood preservation for multidimensional projections

    Get PDF
    Dimensionality reduction techniques are the tools of choice for exploring high-dimensional datasets by means of low-dimensional projections. However, even state-of-the-art projection methods fail, up to various degrees, in perfectly preserving the structure of the data, expressed in terms of inter-point distances and point neighborhoods. To support better interpretation of a projection, we propose several metrics for quantifying errors related to neighborhood preservation. Next, we propose a number of visualizations that allow users to explore and explain the quality of neighborhood preservation at different scales, captured by the aforementioned error metrics.We demonstrate our exploratory views on three real-world datasets and two state-of-the-art multidimensional projection techniques.São Paulo Research Foundation (FAPESP) (grant 2012/07722-9)CAPES–NUFFIC 028/1

    UCCNLP@SMM4H’22:Label distribution aware long-tailed learning with post-hoc posterior calibration applied to text classification

    Get PDF
    The paper describes our submissions for the Social Media Mining for Health (SMM4H) workshop 2022 shared tasks. We participated in 2 tasks: (1) classification of adverse drug events (ADE) mentions in english tweets (Task-1a) and (2) classification of self-reported intimate partner violence (IPV) on twitter (Task 7). We proposed an approach that uses RoBERTa (A Robustly Optimized BERT Pretraining Approach) fine-tuned with a label distribution-aware margin loss function and post-hoc posterior calibration for robust inference against class imbalance. We achieved a 4% and 1 % increase in performance on IPV and ADE respectively when compared with the traditional fine-tuning strategy with unweighted cross-entropy loss

    UNLPSat TextGraphs-16 Natural Language Premise Selection task: Unsupervised Natural Language Premise Selection in mathematical text using sentence-MPNet

    Get PDF
    This paper describes our system for the submission to the TextGraphs 2022 shared task at COLING 2022: Natural Language Premise Selection (NLPS) from mathematical texts. The task of NLPS is about selecting mathematical statements called premises in a knowledge base written in natural language and mathematical formulae that are most likely to be used to prove a particular mathematical proof. We formulated this task as an unsupervised semantic similarity task by first obtaining contextualized embeddings of both the premises and mathematical proofs using sentence transformers. We then obtained the cosine similarity between the embeddings of premises and proofs and then selected premises with the highest cosine scores as the most probable. Our system improves over the baseline system that uses bag of words models based on term frequency inverse document frequency in terms of mean average precision (MAP) by about 23.5% (0.1516 versus 0.1228)
    • …
    corecore